The idea of the project came from the question: “What exactly are the impacts of coronavirus?” It evolved to a desire to pinpoint various sentiments in response to coronavirus and topics related to coronavirus. As coronavirus has plagued the world for the last two years, there appears to be an ongoing discussion surrounding the public’s responses highlighted on social media. With polarized opinions on precautionary measures, we were interested in the influence of political affiliation on responsive actions, such as getting vaccinated. Lower incidence rates tied to high vaccination response would reveal vaccine effectiveness and also where campaigns for vaccination should be focused. We were also curious to see whether Asians were significantly affected by coronavirus as it was initially called the “Chinese virus” by former President Donald Trump; we want to determine whether people associated Asians with the virus. On the news, reports of crimes against Asians have become more prevalent since the start of the pandemic. Determining the severity and recent trends of crimes committed against Asians will illuminate how urgent legislators should take action.
With this in mind, we came up with the following questions:
What is the overall sentiment of social media posts on COVID vaccinations? Specifically, what is the overall sentiment on vaccine mandates and vaccines in general?
Are people’s views on the COVID vaccine related to their political affiliations?
Is there an association between COVID and sentiments towards Asians?
We will answer the question proposed above in the order that they are given using various APIs, visualization methods, and online documentations.
sentiment_workflow
For our sentiment analysis, we pulled comments from both subreddits and youtube videos and then we first applied textblob for the sentiment analysis. We were focused on the polarity score that textblob returns, where scores closer to negative meant negative sentiment and vice versa. However, we found that textblob was highly inaccurate at classfiyng comments as negative or positive sentiment as seen from the comments below. For example, on the dataframe shown below we can see that Vader has a better rate at identifying what sentiment the comment actually is. Since textblob only detects which words are in a comment, if it has more negative words then it is negative. But Vader is able to check syntax and how the words are used to determine whether it is positive or negative to some accuracy. This is based on the Subreddit Republican, and we can see how the first row comment is negative, Textblob associates it as neutral. To elaborate more on Textblob vs Vader, we see that this comment below was given a 0.1461 score by Textblob which means a little positive when the comment is clearly negative and given a correct evaluation by Vader with its score of -0.9825. Due to the inaccuracies of textblob, we then applied nltk's Valence Aware Dictionary for Sentiment Reasoning (Vader).
From the compound scores that Vader returns, with negative compound scores indicating negative sentiment, we can already see that Vader performs better at determining the sentiment of comments. What is also convenient about VADER is that no preprocessing of text needs to be done, since it is meant to classify social media comments including punctuations and capitalizations. This is also the reason that the VADER was applied first before preprocessing the comments for visualization.
When it came to visualizing comments, we only visualized the positive or negative comments depending on the sign of the average compound scores for that video or subreddit. For example if the average resulted in a negative value, we filtered for the comments with negative compound scores and then applied the standard text preprocessing to visualize the words in a wordcloud.
TextBlob_Inaccuraries
| comments | score | subreddit | time_created | processed comments | negative | neutral | positive | compound | polarity | subjectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | > You making a lot of post about people dyi... | 1.0 | Republican | 2020-07-09 02:31:41 | make lot post die lung cancer prostate cancer ... | 0.135 | 0.770 | 0.095 | -0.9851 | 0.080107 | 0.591643 |
| 1 | Santa Clara : A news story about an article th... | 1.0 | Republican | 2020-04-20 18:01:57 | santa clara news story article yet peer review... | 0.205 | 0.795 | 0.000 | -0.9825 | 0.146131 | 0.486756 |
| 2 | The problem is that “high risk” is actually a ... | 1.0 | Republican | 2020-04-23 01:16:15 | problem high risk actually pretty large popula... | 0.145 | 0.756 | 0.098 | -0.9432 | 0.124611 | 0.489219 |
| 3 | Problem is I don’t think even a tiered opening... | 1.0 | Republican | 2020-04-23 01:44:54 | problem think even tiered opening economy solu... | 0.115 | 0.846 | 0.039 | -0.9607 | 0.181812 | 0.360847 |
| 4 | This is a very long post, I apologize. I went ... | 1.0 | Republican | 2020-04-23 20:55:46 | long post apologize go back gather component p... | 0.098 | 0.843 | 0.059 | -0.9786 | 0.048263 | 0.409869 |
TextBlob_Inaccuraries["comments"].tolist()[1]
'Santa Clara : A news story about an article that has not yet been peer reviewed is not evidence\n\nMassachusetts: An observation of 200 people in a county in Massachusetts is in no way able to be extrapolated to the general population or even the population of the state. It says so in the article.\n\nLancet: Their most conservative estimate puts the fatality of COVID at around 2%\n\nFlu kills people. Even with a vaccine, it\'s fatality ratio is about 0.1%. Its R0 (Rho naught) is about 1-2.\n\nSomething with the nearly twice the infectivity and 20x the fatality ratio of the flu is serious.\n\nYou\'re right. No one is saying it is 5%. This is a classic strawman argument that people use to try and "win" against a point which is blatently false.'
To answer the question on vaccine mandates, we used youtube's data api to obtain the top 500 comments for videos that were pertinent to supporting or not supporting the vaccine mandate. We chose only to request for the top 500 comments, since highly popular social media platforms, such as youtube, tend to have copious amounts of irrelevant comments. Therefore, obtaining all the comments would lead to many unrelated comments in our dataset that can skew and invalidate our sentiment analysis. By obtaining the top 500 comments, we essentially obtained a balance between relevancy of comments and sample size.
When choosing the youtube videos to apply our sentiment analysis workflow on, we first used key words such as support or block to search for the videos supporting or not supporting vaccine mandates. We then inspected the description and watched the video to confirm the relevant view of video. The links to these videos are given below.
Youtube videos relevant towards being for vaccine mandate:
Youtube videos relevant towards being against vaccine mandate:
The sentiment analysis workflow was applied to the videos in order that they appear in the list above. We can see that videos that relate to supporting vaccine mandates have an overall negative sentiment, while videos that are related to being against vaccine mandates have an overall positive sentiment. To get a better idea of the specific words in context, we generated word clouds to see the frequently used words in the comments.
results_df
| video_title | average_compound_score | |
|---|---|---|
| 0 | Majority of workers support Biden's business C... | -0.129255 |
| 1 | ‘Shameful’: Doctor on Supreme Court’s Covid-19... | -0.118699 |
| 2 | Why You Can Be Forced To Get The Covid Vaccine | -0.053158 |
| 3 | Supreme Court Blocks Biden’s Covid Vaccine Man... | 0.047864 |
| 4 | Supreme Court blocks Biden's vaccine mandate f... | 0.027544 |
| 5 | Police officer quits over vaccine mandate: 'I ... | 0.294600 |
"Majority of workers support Biden's business Covid vaccine mandate: CNBC Survey" is clearly a video about supporting the vaccine mandate. From the wordcloud below, we can see the negativity of the comments. The most notable words are fake, news, liar, lying, propoganda,media etc. In context, these words exhibit how the commenters had a negative view towards vaccine mandates, since they used mostly negative words, such as lying, to express their distrust in the results of the survey probably because they wanted content that is about being against the vaccine mandates.
generate_word_cloud(vid_1_preprocess[vid_1_preprocess.compound<0],
"Majority of workers support Biden's business Covid vaccine mandate: CNBC Survey",
18)
"‘Shameful’: Doctor on Supreme Court’s Covid-19 vaccine mandate decision" is another video with a view of supporting the vaccine mandate. In this case, a doctor thought that the Supreme Court's decision to block the vaccine mandate was 'shameful' and that there should have been a vaccine mandate. It is important to note that doctors have more medical knowledge than the general public, especially when it comes to vaccinations. Despite being a doctor, the comments suggested that people disliked his viewpoint to by calling his view to mandate vaccines shameful. Interestingly to add onto the negativity towards the doctor, we also prominently see the words government and propoganda, suggesting that comments viewed the doctor as having some political affiliation. To reiterate, doctors carry a lot of knowledge about medicine and how the body works, so this may seem like a more credible source if the doctor supports vaccine mandates, but of course people clearly did not see it that way.
generate_word_cloud(vid_2_preprocess[vid_2_preprocess.compound<0],
"‘Shameful’: Doctor on Supreme Court’s Covid-19 vaccine mandate decision",
20
)
"Why You Can Be Forced To Get The Covid Vaccine" is the last video we obtained pertinent for supporting the vaccine mandate. In this case, we can see that the words used in this case had a negative context. Forced was frequently used and it was generally about desping the idea of being forced to do something. Other notable words are wrong,stop and constitution. In context, people viewed the vaccine mandate as wrong and justified this view by mentioning the constitution, specifically referencing the first amendment.
generate_word_cloud(vid_3_preprocess[vid_3_preprocess.compound<0],
'Why You Can Be Forced To Get The Covid Vaccine',
25)
"Supreme Court Blocks Biden’s Covid Vaccine Mandate For Private Companies" was the first video that we obtained that was about being against vaccine mandates. Recall that the overall sentiment was positive and we can see that there are indeed many positive words. Most notable words are freedom, free, and constitution. Generally, the comments that contained these words viewed the Supreme Court's Decision to block the vaccine mandate as justified due to a contradiction with the first amendment and the idea that America is known for: freedom.
generate_word_cloud(vid_4_preprocess[vid_4_preprocess.compound>0],
'Supreme Court Blocks Biden’s Covid Vaccine Mandate For Private Companies',
20)
"Supreme Court blocks Biden's vaccine mandate for big businesses" is another video relevant towards being against the vaccine mandate. This video is similar to the one before except it is now about big businesses. Nevertheless, the overall sentiment was positive and the frequently used words for the comments was also similar to the previous video. Again we see many positive words such as thank and good. Additionally freedom, seen on the bottom right corner, was also frequently used again suggesting that the idea of freedom is used as a reason to be against the vaccine mandate.
generate_word_cloud(vid_5_preprocess[vid_5_preprocess.compound>0],
"Supreme Court blocks Biden's vaccine mandate for big businesses",
23)
"Police officer quits over vaccine mandate: 'I decided to turn in my badge so I can speak up'" is the last video that we obtained that had a view of not supporting the vaccine mandate. Interestingly, this was a police officer who quit his job due to the vaccine mandate. The overall all sentiment was very positive. Many of the comments were commending the police officer for his decision as seen by words such as right,hero,braveand proud. Freedom was also frequently used, which shows that the idea of freedom is once again being used against the vaccine mandate.
generate_word_cloud(vid_6_preprocess[vid_6_preprocess.compound>0],
"Police officer quits over vaccine mandate: 'I decided to turn in my badge so I can speak up'",
17)
Overall, we can see that videos supporting vaccine mandates had negative sentiment, while videos that were against mandating vaccines had postive sentiment. Logically, this means that the overall sentiment towards mandating vaccines is negative. This stems from the fact that people commenting on youtube find the idea of mandating something a violation of freedom and the constitution. As a result, any video that attempted to give an opinion or justify the vaccine mandate received more negative comments, whereas videos that reasoned against vaccine mandates contained more positive comments.
Reddit is a social media website with a massive collection of online forums called subreddits which are broken up to cover a different topic. For example, for people who likes tennis, there is a tennis subreddit with almost 770,000 members. On each individual subreddit you can post an image, a video, a gif, a dicussion, or anything that has something to do with that subreddit. People who are affiliated or has some interest with that subreddit can join the subreddit and become one of its members.
Subreddits Used
To answer the question on which if people's view on COVID vaccine relate to their political affliations, various politics subreddits were analyzed based on the comments of their members. We used Reddit API and various Reddit scrapers to collect comments from each subreddit that stemmed from 2 years ago to today as long as they collected a few keys words such as "covid vaccine" or "vaccine". We wanted to generate a large sample to see what various users from that subreddit believed and their stances of COVID vaccines. If someone was apart of the subreddit, they are assumed to be affiliated and have some sort of ideology related to that subreddit. For example, if someone was apart of the Republican subreddit, they are someone who shares or apart of their party.
Collecting all the comments from various subreddits, the comments went through a word process stage where stop words were removed and punctuaction. Then, sentiment analysis was applied on each comment made from that subreddit to be analyzed. We associated negative comments as against COVID Vaccines and positive comments as with the COVID Vaccines. We took an average of each subreddit to see whether if it was negative or positive comments made about covid vaccines.
After each subreddit was analyzed, we made a word cloud of the most important subreddits that our project would focus on to see whether each political affliation had a stance on COVID Vaccines.
We also create a graph to display the averages of comments from each subreddit. The attempt was to see the sentiment of each subreddit. Most of the subreddits on political parties and more towards politics as their views could be scatter from positive or negative. Plotting their average compound scores for each subreddit and this is the result. With liberal being the most negative towards covid vaccines and UC Davis subreddit as the most positive towards covid vaccines. Shocking to see how many liberals, which are supposed to be more open minded and progressive, had more negative comments toward vaccines. Additionally, we can see that Democrats were more neutral on the vaccines compared to the other subreddits.
bar_graph_dict = {'Subreddit':[], 'Compound Score Average':[]}
bar_graph = pd.DataFrame(bar_graph_dict)
for i in subreddits:
bar_graph.loc[len(bar_graph.index)] = [i, round(df[i]["compound"].mean(), 4)]
bar_graph
ggplot(bar_graph, aes(x="Subreddit", y ='Compound Score Average', fill = "Subreddit")) + geom_bar(stat = "identity", color = "black") + coord_flip() + theme_light() + xlab("Subreddits") + ylab("Average Compound Scores (Ranges: -1 to 1)") + ggtitle("Subreddit View on Covid Vaccine") + geom_text(aes(label= 'Compound Score Average' ), position = position_stack(vjust = 0.5))
<ggplot: (105984530231)>
wc = WordCloud(width=1000, height=1000, random_state=42, stopwords=STOPWORDS).generate_from_frequencies(data_1[1])
plt.figure(figsize=(10,15))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.title("Subreddit: Republicans", fontsize=20)
plt.show()
The Republican Party in the United State is known for their conservative and right wing views where it is less progressive and more traditional with less changes. Many Republicans have slowly become more entrenched with the anti-vaccine activists and are more prone to be against the COVID Vaccine. This is backed by how on average their comments are negative. Additionally, the Republican subreddit are toxic and what they contain, for example graphene oxide was a wide spread rumors that was supposedly in the COVID vaccines and poisonous. Which is very in line with their party, as there are more Republicans who are unvaccinated than Democrats. Hence, why many of the word cloud contains these filler “poison” chemical words as people try to slander the vaccine.
wc = WordCloud(width=1000, height=1000, random_state=42, stopwords=STOPWORDS).generate_from_frequencies(data_1[3])
plt.figure(figsize=(10,15))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.title("Subreddit: Democrats", fontsize=20)
plt.show()
The Democratic Party is the biggest political party in the United States and they are known for their liberal and left wing views where it is more progressive, for change, and hope to seek greater social and economic equality. We believed that since Democrats are more for change for social good, they would be for a COVID Vaccine. We saw that on average that comments were neutral representing not for or against the vaccines. Additionally, this is backed by the word cloud as a lot of the words were discussing vaccines or discussing anything positive for it.
wc = WordCloud(width=1000, height=1000, random_state=42, stopwords=STOPWORDS).generate_from_frequencies(data_1[5])
plt.figure(figsize=(10,15))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.title("Subreddit: Liberals", fontsize=20)
plt.show()
The Liberals are known as left wing which are known to be progressive and more for change. They seek to create social good and incorporate new ideas. When their average compound scores was calculated, it was shockingly the most negative and thus negative against COVID Vaccines.
A Word Cloud was created to determine whether negative comments were associated with a view against COVID Vaccines. Once looking at the word cloud, we can see the most used word was impeach, clorox, impeached, adviser, and hydrochloroquine. The advisor and impeach portion must be focused on the Chief Medical Advisor to the President of United States and how his stance was for social distancing and getting the COVID vaccine as soon as possible. This must be that the Liberals were more against vaccines and believe it to be dangerous. We can see how one of the most used words was also refusing and clorox. Liberals are generally for change and want to see social good which the vaccine could bring as it could help people stay safe against COVID, the data shows otherwise.
wc = WordCloud(width=1000, height=1000, random_state=42, stopwords=STOPWORDS).generate_from_frequencies(data_1[7])
plt.figure(figsize=(10,15))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.title("Subreddit: UC Davis", fontsize=20)
plt.show()
After calculating the average compound score for each subreddit, we saw that UC Davis had the highest positive score. Being a college subreddit, the assumption is that college students are more educated and more likely to be for a vaccine that is backed up by science and could potentially save lives. Additionally, college students in general are more liberal compared to being conservative and should align with being for vaccines. [10] To determine whether UC Davis students were against or for the vaccines, we looked through the word cloud to see which words were used.
As it is a college subreddit, it discuss more things related to school such as calculus, midterms, and professors. However, beneficial was used a lot in covid vaccines comments and discussing an online school while waiting for covid vaccines. We can say that UC Davis is for the vaccine which make sense as students are more educated about vaccines and the benefits it could entail.
Overall, we can see that subreddits were divided depending on their affiliation on whether they were for the vaccine or against the vaccine. For example, the Republican subreddit was against the vaccines and talked about how it was poisonous. This is most likely caused by their party beliefs and their views. If the party or political beliefs are more center around being left or right, they can be seen to be against the vaccine or for the vaccine, respectfully. However, this is not always the case as seen with Liberals and how they were the most negative in average compound scores and shown to have some stance against the vaccines. We also have to consider the bias that this is just a small portion of the party as the subreddits do not account for all and every person from that party or political beliefs.
When considering how a person responds to the pandemic, whether it be wearing masks in public or getting the COVID vaccine, we were interested in learning about how one's political affiliation may play a role in their willingness to adhere to state mandates. After looking at several subreddits (specific communities on Reddit) on COVID-related posts, there seems to be a difference in sentiment towards the pandemic for Republicans and Democrats. In order to get a comprehensive understanding of the differences in response, we analyzed data throughout the course of the entire pandemic in addition to recent statistics.
Looking at state data using several APIs, we web scraped the information into dataframes containing relevant information. The CDC provides a extensive number of datasets that can be retrieved from the Open Data Source API. For number of cases specifically, we used the Covid Act Now API to obtain the time series for two of the largest states in the United States with polarized political values, California ($\color{blue}{\text{Dem}}$) and Texas ($\color{red}{\text{Rep}}$).
To evaluate the relationship between vaccination rate and political affiliation, we referred to the CDC's data on vaccine hesitancy for counties in America along with the majority political party identification of each state from Wikipedia's Political party strength in U.S. states page.
Using a line plot to examine the incidence rate over the course of the pandemic (changes in growth rates at each point in time will be the slope of the curves). Line plots are optimal for displaying time series data, clearly showing changes in growth of case rates each quarter starting from January 2020.
sns.lineplot(x=joint_df.date, y=joint_df["Cases Per 100K"], hue=joint_df["state"],
palette=["blue", 'red']).set_title("Incidence Rate Over Time").get_figure().autofmt_xdate()
print(stat, p) # return t-stat of case rates
-2.3070487905818493 0.02118490789324234
We perform a Student's t test to see if there is a true different in mean incidence rates for the two states. The negative t-statistic, -2.3070487905818493, indicates that California had a lower mean incidence rate over the course of the pandemic. We conclude that the two states did not have the same mean incidence rate because our p-value of 0.02118490789324234 is significant.
The CDC provided information on county hesitancy rates based on Household Pulse Survey responses. A response of "1" corresponds to "definitely getting the vaccine" and a response of "5" means "definitely not getting the vaccine." We take the average hesitancy over counties of each state using panda's groupby.
The COVID Data Tracker from the CDC also establishes the number of doses delivered for each state.
Wikipedia includes several tables highlighting different political strengths for states in the U.S. We specifically chose to look at poltical party identification majorities in each state.
sns.scatterplot(x = final_df['Mean Estimated Hesitancy'], y = final_df['Doses Delivered per 100K'],
hue=final_df['Party identification'],
palette=['red', 'blue', 'gray']).set_title("Vaccine Rate vs. Hesitancy by Party")
Text(0.5, 1.0, 'Vaccine Rate vs. Hesitancy by Party')
corr
-0.7792915361789933
As expected, states with higher hesitancy also had lower doses delivered. There is a strong negative correlation between hesitancy and number of vaccine doses delivered overall. Almost 80% of the variation in number of doses delivered is explained by vaccine hesitancy.
corr
0.5909235276025876
There is a strong positive correlation between political affiliation and mean estimated hesitancy. Republicans generally had higher hesitancy towards getting the COVID vaccine. About 60% of the variation in hesitancy is explained by the state's majority registered political party.
We now consider the incidence rates as of March 5 for all states.
sns.scatterplot(x = cases_final['Mean Estimated Hesitancy'], y = cases_final['Cases Per 100K'],
hue=cases_final['Party identification'],
palette=['red', 'blue', 'gray']).set_title("Incidence Rates vs. Hesitancy by Party")
Text(0.5, 1.0, 'Incidence Rates vs. Hesitancy by Party')
corr
0.581533647025002
There is a positive correlation between vaccine hesitancy and incidence rates for states in the US. Close to 60% of the variation in incidence rates is explained by mean vaccine hesitancy.
corr
0.5043996947529066
When considering more states, there is still a positive correlation between incidence rates and political party. Around 50% of the variation in incidence rates is explained by political party.
Overall, there appears to be an association between political affiliation and response to COVID measures. Texas had a higher mean incidence rates throughout the course of the pandemic (Jan 2020-now) than California. Additionally, political party seems to correlate with incidence rates when considering recent numbers. Political party's association with vaccine hesitancy explains the difference in numbers of vaccine doses delivered in Republican and Democratic states. If the COVID vaccine is indeed effective, the lack of doses delivered also plays a role in the incidence rates of states.
On May 8, 2020, United Nations Secretary-General Antonio Guterres said that “the pandemic continues to unleash a tsunami of hate and xenophobia, scapegoating and scare-mongering” and urged governments to “act now to strengthen the immunity of our societies against the virus of hate.” In this part we want to visualize the relationship between covid-19 and discrimination towards Asian.
We quantified the discrimination towards Asians as number of articles that included the key word "anti-Asian" for each month starting from 2018 to 2022, in which 2018's and 2019's data as reference. Then we quantified the impact from Covid-19 using the number of covid-19 caused death and number of covid cases in each month since 202001, when the covid was first reproted in the US.
We used the NYtimes api (https://developer.nytimes.com/apis) to search for the articles that included the word "anti-Asian" and each request will return at most 10 results for a given page. In that case, we wrote a function to continouly reqest for all the pages and stop when the page returned no result. At the same time, this funciton measured the number of results in each page for a give month, e.g. 20190101 to 20190201, and summed up to get the total number of anti-Asian articles of each month.
We obtained the number of death and cases from the NYtime official github. (https://github.com/nytimes/covid-19-data)
Then we combined these to data in plot to visulize their relationship.
import matplotlib.pyplot as plt
# setup the seize of the plot
plt.rc('font', size=12)
fig, ax = plt.subplots(figsize=(18, 8))
ax2 = ax.twinx()
# Specify how our lines should look: x-axis = date, y-axis_1 = number of anti-asian articles, y-axis_1 = covid-19 cases
ax.plot(total.begin_date, total.num_article, color='tab:orange', label = 'Number of Anti-Asian Articles per month')
ax2.plot(total.begin_date, montly_case, color='tab:blue', label = 'Covid cases per month')
ax.axvline(x='20200201', color='red', linestyle='--', label = 'covid cases first reported in the us')
# label x-axis, y-axis, and title
ax.set_xlabel('Date')
ax.set_ylabel('Number of Anti-Asian Articles')
ax2.set_ylabel('Covid cases')
# roate the x axis label by 45 degree
ax.tick_params(axis='x', labelrotation = 45)
ax.legend(bbox_to_anchor=(0.304, 0.95))
ax2.legend(bbox_to_anchor=(0.2, 1))
ax.set_title("Number of Anti-Asian Articles vs Covid Cases")
Text(0.5, 1.0, 'Number of Anti-Asian Articles vs Covid Cases')
# setup the seize of the plot
plt.rc('font', size=12)
fig, ax = plt.subplots(figsize=(18, 8))
ax2 = ax.twinx()
# Specify how our lines should look: x-axis = date, y-axis = market_cap
ax.plot(total.begin_date, total.num_article, color='tab:orange', label = 'Number of Anti-Asian Articles per month')
ax2.plot(total.begin_date, montly_death, color='tab:blue', label = 'Covid death per month')
ax.axvline(x='20200201', color='red', linestyle='--', label = 'covid cases first reported in the us')
# label x-axis, y-axis, and title
ax.set_xlabel('Date')
ax.set_ylabel('Number of Anti-Asian Articles')
ax2.set_ylabel('Covid death')
# roate the x axis label by 45 degree
ax.tick_params(axis='x', labelrotation = 45)
ax.legend(bbox_to_anchor=(0.304, 0.95))
ax2.legend(bbox_to_anchor=(0.2, 1))
ax.set_title("Number of Anti-Asian Articles vs Covid Death")
Text(0.5, 1.0, 'Number of Anti-Asian Articles vs Covid Death')
It is obvious that the number of articles included the workd "anti-Asian" overly increased after the starting of covid-19 pandemic. There is not a clear relationship between the number of covid-19 cases per month and number of anti-asian articles. However, there is always a very significant incrase in the number of anti-Asian articles following a pead of number of covid-19 caused death (202003 and 202011), which indicates an increase in discrimination towards Asian after going through a large number of covid caused death. Based on these features, we can conclude that covid-19 does increase the discrimination towards Asians.
Overall, the sentiment on vaccine mandates on youtube was negative due to the conflicts with the constitution and freedom. It is important to note, however, that we only included videos that had the most views relevant towards vaccine mandates; it may be the case that the overall sentiment is different when including more videos, but we were unable to do so due time and length of the report.
The subreddits that were chosen and analyzed were based on their politics or political party. We were able to determine that Reddit was divided depending on their affiliation and had some prior beliefs relating to vaccines. Left wings were more for the vaccine, except for liberals, and right wings were against the vaccine. However, we should spend more time checking more comments thoroughly and subreddits to have a bigger sample size to determine whether or not the general consensus was for or against vaccines.
Additionally, it is also worth noting that social media platforms are not a reliable source, so the overall sentiment of a particular platform should not be extrapolated to conclude the general population’s view.
When considering one’s background for their response to COVID measures, we discovered that there was a correlation between political affiliation and the willingness to get vaccinated. In a narrow scope of highlighting only the biggest states for each party (California vs. Texas), we found that the mean incidence rates over the course of the pandemic were different; Texas had a significantly higher mean rate. After considering more states on their response to COVID, there were noticeable associations between majority political identification of states and mean vaccine hesitancy of their counties. Additionally, there was a positive correlation between incidence rates and hesitancy; thus, we can conclude that higher COVID vaccination rates are correlated to lower number of cases. Looking at political affiliation and mean hesitancy response will reveal which states (specifically which of their counties) need the most enforcement for the COVID vaccine in order to lower their incidence rates. For some future considerations, we could plot vaccination rate over time for each state as each state has some major political party raising for more comparison and discussion.
According to our visualization for question 4, there is an increasing discrimination towards Asian Coummnity during Covid-19. Furthermore, an increase in Covid-19-related deaths correlates with an increase in the number of articles related to anti-Asian reports, which we use to quantify the discrimiation towards Asians. Thus, related policy should be made to combat anti-Asian bias and violence and extra funding should be assigned to better protect Asian community during the Covid-19 pandemic. For some future considerations, we could conduct similar plots for different countries to better understand how covid-19 affect discrimination against Asian in dfferent countries.
from googleapiclient.discovery import build
from textblob import TextBlob
import nltk
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import STOPWORDS
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from IPython.display import Image
def get_comments(vid_id):
"""
Takes as input a youtube video id as a string and returns the top 500 comments in a dataframe
---
Note this code was inspired from
https://www.geeksforgeeks.org/how-to-extract-youtube-comments-using-youtube-api-python/ (see reference).
---
"""
# create list for comments
comments=[]
# create the youtube object to use api
api_key=''
youtube= build('youtube','v3',developerKey=api_key)
# obtain video results
data=youtube.commentThreads().list(
part='snippet,replies',
videoId=vid_id,
maxResults=100).execute()
num_iterations=1 # counter variable to limit results to top 500 comments
while data:
for items in data['items']:
# get comments and add to list
comment=items['snippet']['topLevelComment']['snippet']['textDisplay']
comments.append(comment)
# move to the next set of 100 comments until top 500 are obtained
if 'nextPageToken' in data and num_iterations <5:
data=youtube.commentThreads().list(
part='snippet,replies',
videoId=vid_id,
pageToken=data['nextPageToken'],
maxResults=100).execute()
num_iterations+=1
else:
break
# return top 500 comments in a dataframe
df=pd.DataFrame(comments,columns=["comments"])
return(df)
stopwords = nltk.corpus.stopwords.words("english")
# add more words to remove
stopwords.extend(["vaccine","vaccinate","vaccinated","vaccination","covid","vaccines","br","quot","people",
"know","one","want","take","virus","getting","got","nurse","cnn","mandate"])
def wordnet_pos(tag): #this function is used to convert a brown pos tag to a wordnet pos tag
"""Map a Brown POS tag to a WordNet POS tag."""
table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
# Default to a noun.
def stop_words_remover(sentence):
"""Removes stop words from sentences and returns back the filtered sentence as a string"""
return " ".join([i for i in nltk.word_tokenize(sentence) if i not in stopwords])
def remove_punctuations(sentence):
"""Removes any non alphanumeric words/characters from sentences"""
return " ".join([i for i in nltk.word_tokenize(sentence) if i.isalnum()])
def to_lower(sentence):
"""Lower case all words in a sentence """
return " ".join([i.lower() for i in nltk.word_tokenize(sentence)])
lemmatizer = nltk.WordNetLemmatizer()
def lemmatize(sentence):
"""Lemmatize all words in a sentence by returning the base form"""
return " ".join([lemmatizer.lemmatize(i) for i in nltk.word_tokenize(sentence)])
def preprocess(df,col):
"""Preprocesses sentences by lower casing, removing stopwords and all non alphanumeric words/charaters."""
df['processed_comments']=df[col].apply(to_lower).apply(stop_words_remover).apply(remove_punctuations)
return df
def separator_words(sentence):
"""
Separate the words first and then take out the stop words from it, input is a sentence.
"""
lemmatizer = nltk.WordNetLemmatizer()
#lower case the words first and then separates the sentence
tense_word = [i.lower() for i in nltk.word_tokenize(sentence)]
#tagging the word with nltk
tense_word = nltk.pos_tag(tense_word)
tense_word = [(word, wordnet_pos(t)) for (word, t) in tense_word]
#takes stop words out
tense_wordsv2 = [lemmatizer.lemmatize(word, t) for (word, t) in tense_word]
return " ".join([w for w in tense_wordsv2 if w not in stopwords])
def text_blob_sentiment(df,col):
"""
Takes a dataframe and a referenced column to compute the polarity and subjectivity scores.
Returns a dataframe with the polarity and subjectivity scores appended.
"""
# create textblob object and compute polarity and subjectivity scores for each sentence
blob_objects=df[col].apply(TextBlob)
sentiment_scores=[{"polarity":i.sentiment.polarity,
"subjectivity":i.sentiment.subjectivity
}for i in blob_objects.values]
# return dataframe with appended subjectivity and polarity scores
sentiment_scores_df=pd.DataFrame(sentiment_scores)
sentiment_df=pd.concat([df,sentiment_scores_df],axis=1)
return(sentiment_df)
def vader_sentiment(df,col):
"""
Takes as input a dataframe and a referenced column to compute the sentiment scores using nltk vader.
Note: the inputted dataframe is modified, so the output does not need to be referenced by a new variable.
"""
# create instance of vader sentiment analyzer and compute sentiment scores
sia=SentimentIntensityAnalyzer()
sentiment_scores=df[col].apply(sia.polarity_scores)
# append scores to dataframe
df['negative']=sentiment_scores.apply(lambda neg:neg["neg"]) #[2]
df['neutral']=sentiment_scores.apply(lambda neu:neu["neu"]) #[2]
df['positive']=sentiment_scores.apply(lambda pos:pos["pos"]) #[2]
df['compound']=sentiment_scores.apply(lambda comp:comp["compound"]) #[2]
return df
def generate_word_cloud(df,title,fig_size):
"""
Takes as input a dataframe and generates a word cloud for preprocessed comments.
Title and figure size are also required inputs for the purpose of customization
"""
# create instance of wordcloud
wc=WordCloud(width=1600,height=800,background_color='black',
stopwords=STOPWORDS).generate(" ".join(df.processed_comments))
# adjust for figure aesthetics
plt.figure(figsize=(14,10))
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.title(title,size=fig_size)
# obtain comments and average polarity score for first video
vid_1=get_comments('7Qk3NyAV_Q4')
# text preprocessing
vid_1_preprocess=preprocess(vid_1,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_1_preprocess=vader_sentiment(vid_1_preprocess,'comments')
vid_1_avg_score=vid_1_preprocess.compound.mean()
# obtain comments and average polarity score for second video
vid_2=get_comments('rzUASFzT_2k')
# text preprocessing
vid_2_preprocess=preprocess(vid_2,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_2_preprocess=vader_sentiment(vid_2_preprocess,'comments')
vid_2_avg_score=vid_2_preprocess.compound.mean()
# obtain comments and average polarity score for third video
vid_3=get_comments('z8cYMT3ds4o')
# text preprocessing
vid_3_preprocess=preprocess(vid_3,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_3_preprocess=vader_sentiment(vid_3_preprocess,'comments')
vid_3_avg_score=vid_3_preprocess.compound.mean()
# obtain comments and average polarity score for fourth video
vid_4=get_comments('zgVo9i-xgwo')
# text preprocessing
vid_4_preprocess=preprocess(vid_4,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_4_preprocess=vader_sentiment(vid_4_preprocess,'comments')
vid_4_avg_score=vid_4_preprocess.compound.mean()
# obtain comments and average polarity score for fifth video
vid_5=get_comments('4HsgKF2hagQ')
# text preprocessing
vid_5_preprocess=preprocess(vid_5,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_5_preprocess=vader_sentiment(vid_5_preprocess,'comments')
vid_5_avg_score=vid_5_preprocess.compound.mean()
# obtain comments and average polarity score for sixth video
vid_6=get_comments('qI9kOw4K6is')
# text preprocessing
vid_6_preprocess=preprocess(vid_6,'comments')
# obtain vader sentiment scores and obtain the mean compound
vid_6_preprocess=vader_sentiment(vid_6_preprocess,'comments')
vid_6_avg_score=vid_6_preprocess.compound.mean()
vid_titles=["Majority of workers support Biden's business Covid vaccine mandate: CNBC Survey",
"‘Shameful’: Doctor on Supreme Court’s Covid-19 vaccine mandate decision",
"Why You Can Be Forced To Get The Covid Vaccine",
"Supreme Court Blocks Biden’s Covid Vaccine Mandate For Private Companies",
"Supreme Court blocks Biden's vaccine mandate for big businesses",
"Police officer quits over vaccine mandate: 'I decided to turn in my badge so I can speak up'"
]
average_score=[vid_1_avg_score,
vid_2_avg_score,
vid_3_avg_score,
vid_4_avg_score,
vid_5_avg_score,
vid_6_avg_score]
# summarise results of the average polarity score for each video in a dataframe
results_df=pd.DataFrame({"video_title":vid_titles,
"average_compound_score":average_score})
# image of sentiment analysis workflow
sentiment_workflow=Image(filename='sentiment_analysis_workflow.png',height=500,width=800)
import requests
import pandas as pd
import json
import numpy as np
import datetime
import csv
import praw
import json
from datetime import datetime
import glob
import os
from psaw import PushshiftAPI
import nltk
import nltk.corpus
from autocorrect import Speller
from textblob import TextBlob
from nltk.corpus import wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
api = PushshiftAPI()
import matplotlib.pyplot as plt
from plotnine import *
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
#got it from reddit
CLIENT_ID = ""
SECRET_KEY = ""
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_KEY) #getting authorizations
#putting password in txt file instead of writing it down
f = open("password_reddit.txt","r")
lines=f.readlines()
#getting the password
password=lines[0]
f.close
#reddit API needs this in this format
data = {
'grant_type': 'password',
'username': 'randomlymixed',
'password': password
}
user_agent = "MyAPI 1.0"
#need the user_agent for later API usage
headers = {'User-Agent': "MyAPI/0.0.1"}
#reddit API
res = requests.post('https://www.reddit.com/api/v1/access_token',
auth=auth, data= data, headers=headers)
#getting the token to access reddit api
TOKEN = res.json()['access_token']
#using to get my authorization key
headers['Authorization'] = f'bearer {TOKEN}'
#testing if I have access with my credentials
requests.get('https://oauth.reddit.com/api/v1/me', headers = headers).json()
#this is for praw, another reddit scraper. Helps to scrape comments from reddit
reddit = praw.Reddit(
client_id = CLIENT_ID,
client_secret = SECRET_KEY,
user_agent = user_agent
)
def get_reddit_data(after, before, term, sub, limit): #pushshift is the scraper I used, more info at https://github.com/pushshift/api
''' Purpose of the function is to get all reddit comments and additional with certain other informationkey terms
within the period of after and before days. All the reddit information is stored in a dictionary.
after: is a string in the format of "#d". Ex: 720d
before: is a string in the format of "#d". Ex: 500d
term: key words that we will search in. String format. if spaces adds dashes instead. Ex: "Covid-Vaccine"
sub: is a string in the format of "sub" Short for subreddit. Ex: "AskReddit"
limit: how many comments should it pull. Max limit is 100. in a str format "#". Ex: "100"
'''
url = 'https://api.pushshift.io/reddit/search/comment/?q='+ str(term) + '&after=' +str(after)+ '&before='+str(before)+'&subreddit=' +str(sub) +'&size='+str(limit)
#gives us the url so we can go to the site ourselves to see what is on it
print(url)
#running to get what we need
res = requests.get(url)
#puts it in a dictionary for us
data = json.loads(res.text, strict = False)
return data['data']
df_comments = pd.DataFrame() #initalizing a dataframe outside so I can have a copy of it
def comment_reddit(after, before, term, subreddit, limit):
'''
Purpose is to store all the comments in a dataframe that we scraped from the other function. Return it in a pd.DataFrame()
after: is a string in the format of "#d". Ex: 720d
before: is a string in the format of "#d". Ex: 500d
after: is a string in the format of "#d". Ex: 720d
before: is a string in the format of "#d". Ex: 500d
term: key words that we will search in. String format. if spaces adds dashes instead. Ex: "Covid-Vaccine"
sub: is a string in the format of "sub" Short for subreddit. Ex: "AskReddit"
limit: how many comments should it pull. Max limit is 100. in a str format "#". Ex: "100"
'''
# creates an empty DataFrame
df_comments = pd.DataFrame()
#uses previous function to get dictionary of information
data = get_reddit_data(str(after), str(before), str(term), str(subreddit), str(limit)) #using other function
#for each item in dictionary, we will get the subreddit, score of the comment, and when it was created
#putting it in the data frame that was previously made
for submission in data:
df_comments = df_comments.append({'subreddit': submission['subreddit'],
'score': submission['score'],
'comments': submission['body'],
'time_created': datetime.utcfromtimestamp(submission['created_utc']).strftime('%Y-%m-%d %H:%M:%S')
}, ignore_index = True)
return(df_comments)
def data_collection_reddit(list_name, subreddit, folder):
'''
The purpose was to create all the dfs we created into csv to store. Based on some requirements
list_name: a list of key terms that should be searched on. List should be strings. Ex: ["vaccine","anti-vaccine", "COVID-vaccine"]
subreddit: what subreddit we want to search the terms on. In str format. Ex: "Republican"
'''
#to name the file easier
counter = 1
#Essentially we are we want to go back about 2 years when the pandemic first started hence
#720 days ago, and we want it in increments of 6 months so we will only want comments between 720 and 540 days ago
#then we put it in a dataframe, and if the dataframe is empty hence no comments from that time period, we continue
#but if there is data from that time period, we will make the dataframe into a csv to store it easier
for i in list_name:
df1 = pd.DataFrame()
for j in range(720, 540, -20): #in increments of 20 days
try:
df1 = comment_reddit(str(j)+"d", "540d", i, str(subreddit), '400')
if df1.empty:
continue
else:
df1.to_csv(os.path.join(os.getcwd()+"\\"+str(folder))+"\\03_11_2020-09_07_2020, Subreddit-" + str(subreddit) + str(counter) +".csv", index=False)
counter += 1
except:
continue
#We want it in increments of 6 months so we will only want comments between 540 and 360 days ago
#then we put it in a dataframe, and if the dataframe is empty hence no comments from that time period, we continue
#but if there is data from that time period, we will make the dataframe into a csv to store it easier
for i in list_name:
df1 = pd.DataFrame()
for j in range(540, 360, -20):
try:
df1 = comment_reddit(str(j)+"d", "360d", i, str(subreddit), '400')
if df1.empty:
continue
else:
df1.to_csv(os.path.join(os.getcwd()+"\\"+str(folder))+"\\09_11_2020-03_06_2021, Subreddit-" + str(subreddit) + str(counter) +".csv", index=False)
counter += 1
except:
continue
#We want it in increments of 6 months so we will only want comments between 360 and 180 days ago
#then we put it in a dataframe, and if the dataframe is empty hence no comments from that time period, we continue
#but if there is data from that time period, we will make the dataframe into a csv to store it easier
for i in list_name:
df1 = pd.DataFrame()
for j in range(360, 180, -15):
try:
df1 = comment_reddit(str(j)+"d", "180d", i, str(subreddit), '400')
if df1.empty:
continue
else:
df1.to_csv(os.path.join(os.getcwd()+"\\"+str(folder))+"\\03_06_2021-09_02_2021, Subreddit-" + str(subreddit) + str(counter) +".csv", index=False)
counter += 1
except:
continue
#We want it in increments of 6 months so we will only want comments between 180 and 0 days ago
#then we put it in a dataframe, and if the dataframe is empty hence no comments from that time period, we continue
#but if there is data from that time period, we will make the dataframe into a csv to store it easier
for i in list_name:
df1 = pd.DataFrame()
for j in range(180, 0, -9):
try:
df1 = comment_reddit(str(j)+"d", "0d", i, str(subreddit), '400')
if df1.empty:
continue
else:
df1.to_csv(os.path.join(os.getcwd()+"\\"+str(folder))+"\\09_02_2021-03_01_2022, Subreddit-" + str(subreddit) + str(counter) +".csv", index=False)
counter += 1
except:
continue
def all_csv_files(dates, subreddit, folder):
'''
Find all the csvs and makes it into a list. It is a list within a list
dates: list format. It is which dates you have collected your data, essentially naming scheme.
subreddit: Which subreddit, naming scheme for your data. Ex: "Republican"
folder: where are the CSVs located? Based on your working directory + folder. Str format
'''
#find all the csvs' names and makes it into a list
csv_list = []
#based on the dates given
for i in dates:
csv_list.append(glob.glob(os.path.join(os.getcwd()+"\\"+str(folder), str(i) +", Subreddit-" + str(subreddit)+ "*.csv")))
#returns the list of csv names
return(csv_list)
df = {} #initalizing dataframe so I can have a copy
def conv_csv_to_pd(sub):
'''
Purpose is to collect all the csvs we found then convert it back to dataframes and combine them with one another based on
the subreddit we chose.
sub: list format, should be the lsit of subreddit used. Ex: ["Republican", "Democrats","Conservatives"]
'''
#makes an empty dictionary
data_set ={}
for name in sub:
combined_data = []
#get all the d
dataset = all_csv_files(["03_11_2020-09_07_2020","09_11_2020-03_06_2021","03_06_2021-09_02_2021","09_02_2021-03_01_2022",], str(name), "Data")
#in case of failure
try:
for i in range(0,4):
#adds all the dataframes together
data_ = pd.concat(map(pd.read_csv, dataset[i]), ignore_index=True)
#if it exists, no need and just replaced the dataset
if str(i) in data_set.keys():
data_set[str(i)] = data_.drop_duplicates()
#making a new dataframe each time for the new subreddit
else:
data_set[str(i)] = pd.DataFrame()
data_set[str(i)] = data_.drop_duplicates()
combined_data.append(data_set[str(i)])
combined_data1 = pd.concat(combined_data).drop_duplicates()
combined_data1.index = pd.RangeIndex(start=0, stop = combined_data1.shape[0], step= 1)
#if it exists, no need and just add dataframe to df
if name in df.keys():
df[name] = combined_data1
#making a new dataframe each time for the new subreddit
else:
df[name] = pd.DataFrame()
df[name] = combined_data1
except:
print("We have no data on that " + str(name) + " subreddit.")
return(df)
data_frame = {}
corpus =[]
def key_word_search(df,col,sub):
"""Purpose of the key_word_search is to get the most important and frequent word used in all the comments. Frequent and
importance of the word is important. Using a TF-IDF to grade whether the word is important or not.
It takes a dictionary, column, and list of subreddits needed.
"""
data_frame = {}
corpus =[]
#using the to determine whether the word is important and frequency
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1,1), max_df = .6, min_df = .01)
#joins the comments of each subreddit into one large giant text and makes its own individual element
for name in sub:
data_frame[name] = ' '.join(df[name][col])
corpus.append(data_frame[name])
#running the vectorizer on the giant lists of subreddit comments.
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
dense_matrix = X.todense()
#makes the density function into a lsit
denselist = dense_matrix.tolist()
#puts it in a dataframe and tranpose it so each column is the individual element
df_1s = pd.DataFrame(denselist, columns=feature_names)
dataset = df_1s.transpose()
return(dataset)
for i in subreddits:
# going through the comments and processes by removing stopwords and punctuations
df[i]["processed comments"] = df[i]['comments'].apply(remove_punctuations)
df[i]["processed comments"] = df[i]["processed comments"].apply(separator_words)
#applies the vader sentiments
for i in subreddits:
df[i] = vader_sentiment(df[i],"comments")
#applies the textblob sentiment
for i in subreddits:
df[i] = text_blob_sentiment(df[i],"processed comments")
data_1 = key_word_search(df, "processed comments", ['CoronavirusCA',
'Republican',
'AskReddit',
'Democrats',
'socialism',
'Liberal',
'Conservative',
'moderatepolitics',
'UCDavis',
'Libertarian',
'Politics'])
#this is running our key_word_search to find the most important words and print them to a word cloud.
#this is getting 5 rows where TextBlob and Vader different extensively
TextBlob_Inaccuraries = df["Republican"].iloc[[0,4,5,6,7,],:]
#resetting the index of the rows
TextBlob_Inaccuraries.index = pd.RangeIndex(start=0, stop = TextBlob_Inaccuraries.shape[0], step= 1)
import requests
import pandas as pd
import numpy as np
import lxml.html as lx
import seaborn as sns
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(11.7,8.27)})
import matplotlib.pyplot as plt
import json
allstates = requests.get("https://api.covidactnow.org/v2/states.timeseries.json?apiKey={key}")
data = json.loads(allstates.text)
def pop_normalize(case, pop):
return case/pop*100000 # return incidence rate
# population size is a major influence on number of total cases, normalized the data using the states' populations.
ca_df = pd.json_normalize(data[4]) # normalize CA data from the list of all json
tx_df = pd.json_normalize(data[45]) # normalize TX data
ca_ts = pd.json_normalize(ca_df["actualsTimeseries"][0]) # select times series json from CA
tx_ts = pd.json_normalize(tx_df["actualsTimeseries"][0]) # select times series json from TX
ca_ts["Cases Per 100K"] = [pop_normalize(case, ca_df.population) for case in ca_ts["cases"]]
ca_ts["Cases Per 100K"] = [ca_ts["Cases Per 100K"][i][0] for i in range(len(ca_ts["Cases Per 100K"]))] # select numeric
# incidence rate
tx_ts["Cases Per 100K"] = [pop_normalize(case, tx_df.population) for case in tx_ts["cases"]]
tx_ts["Cases Per 100K"] = [tx_ts["Cases Per 100K"][i][0] for i in range(len(tx_ts["Cases Per 100K"]))]
ca_ts["date"] = pd.to_datetime(ca_ts["date"])
tx_ts["date"] = pd.to_datetime(tx_ts["date"])
ca_ts["state"] = "CA"
tx_ts["state"] = "TX"
joint_df = pd.concat([ca_ts, tx_ts], ignore_index=True) # stack both dataframes
joint_df # ignore NaN, irrelavant to columns needed
from scipy.stats import ttest_ind
stat, p = ttest_ind(ca_ts["Cases Per 100K"].dropna(), tx_ts["Cases Per 100K"].dropna()) # drop rows with NaN for both states
hesitancy = pd.read_json("https://data.cdc.gov/api/id/q9mh-h2tw.json")
#https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total
vaccination = pd.read_csv("C:\\Users\\chris\\Documents\\STA 141B\\covid19_vaccinations_in_the_united_states.csv")
vaccination.columns = vaccination.iloc[3,:] # column names are in third row
vaccination = vaccination.iloc[4:,:] # remove the blank first rows
vaccination['State/Territory/Federal Entity'] = [state.lower() for state in vaccination['State/Territory/Federal Entity']]
# lower state names for merging dfs
import urllib.request as urlrq
import certifi
import ssl
resp = urlrq.urlopen("https://en.wikipedia.org/wiki/Political_party_strength_in_U.S._states",
context=ssl.create_default_context(cafile=certifi.where()))
ssl._create_default_https_context = ssl._create_unverified_context # bypass certification requirement
lst= pd.read_html("https://en.wikipedia.org/wiki/Political_party_strength_in_U.S._states") # read in tables
def clean_party(states): # remove irrelavant details of party identification (numbers of party split)
if "Republican" in states:
return "Republican"
elif "Democratic" in states:
return "Democratic"
else:
return "Even"
df = pd.DataFrame(lst[5]) # make df out of table of interest (6th)
df = df.rename(columns={"State": "state", 'Party registration or identification (% as of 2020': "Party identification"})
df["Party identification"] = [clean_party(state) for state in df["Party identification"]]
df['state'] = [state.lower() for state in df['state']] # lower for df merging
hesitancy['state'] = [state.lower() for state in hesitancy['state']]
corr_df = hesitancy.merge(df, on='state') # combine the dfs on state name
corr_df
vaccination = vaccination.rename(columns={'State/Territory/Federal Entity':'state'})
corr_df = corr_df.merge(vaccination, on = 'state')
final = corr_df.groupby(['state', 'Party identification', 'Doses Delivered per 100K']).estimated_hesitant.mean()
# groupby states and return the estimated mean hesitancy for each state state
final_df = pd.DataFrame()
final_df['state'] = final.index.get_level_values(0) # get values from indices of groupby object
final_df['Party identification'] = final.index.get_level_values(1)
final_df['Doses Delivered per 100K'] = final.index.get_level_values(2)
final_df['Mean Estimated Hesitancy'] = final.values
final_df['Doses Delivered per 100K'] = [int(x) for x in final_df['Doses Delivered per 100K']] # change values into numbers
from scipy.stats import pearsonr
corr, _ = pearsonr(final_df["Mean Estimated Hesitancy"], final_df["Doses Delivered per 100K"]) #return pearson correlation
final_df["Binary Party"] = [1 if row == 'Republican' else 0 for row in final_df["Party identification"]]
# Create binary labels to perform correlation measure with political party
corr, _ = pearsonr(final_df["Mean Estimated Hesitancy"], final_df["Binary Party"])
# Generated march 5
last7_df = pd.read_csv("C:\\Users\\chris\\Documents\\STA 141B\\united_states_covid19_cases_deaths_and_testing_by_state.csv")
last7_df.columns = last7_df.iloc[1,:] # column names are in 2nd row
last7_df = last7_df.iloc[1:,:] # remove blank row
last7_df = last7_df.rename(columns={'State/Territory': 'state'})
last7_df['state'] = [state.lower() for state in last7_df['state']]
cases_final = final_df.merge(last7_df, on='state')
cases_final["Case Rate per 100000"] = cases_final['Case Rate per 100000'].astype(int) # change case rate datatype into int
cases_final = cases_final.rename(columns={"Case Rate per 100000": "Cases Per 100K"})
corr, _ = pearsonr(cases_final["Mean Estimated Hesitancy"], cases_final["Cases Per 100K"])
corr, _ = pearsonr(cases_final["Binary Party"], cases_final["Cases Per 100K"])
import requests
import pandas as pd
import numpy as np
# a functino to find the number of articles in each page for a given month
def find_article_n(begin, end):
session = requests_cache.CachedSession('demo_cache')
con = True
i = 0
num_art = 0
while con == True:
sc = session.get("https://api.nytimes.com/svc/search/v2/articlesearch.json?", params = {
"q":"anti-asian",
"page":str(i),
"page-size":"50",
"begin_date":begin,
"end_date":end,
"api-key":"WFGQS8gKLAUWGvAnphJLOO2uN5MH4jyJ"
})
sc.raise_for_status()
if len(sc.json()['response']['docs']) != 0:
num_art += len(sc.json()['response']['docs'])
i += 1
else:
con = False
time.sleep(10)
return(num_art)
# We modified the for loop several times to get the exact total data from 201801 to 202201 and saved in the
# file all_year_articlecase and death.csv.
for i in range(1):
begin = '20220'+str(i+1)+'01'
end = '20220'+str(i+2)+'01'
this_mon = find_article_n(begin, end)
begin_date.append(begin)
end_date.append(end)
num_art.append(this_mon)
time.sleep(10)
total = pd.read_csv('/Users/pilehe/Desktop/sta 141b/homework/project/all_year_article.csv')
# set the date to be string
total['begin_date']= total['begin_date'].astype(str)
death_case = pd.read_csv('/Users/pilehe/Desktop/sta 141b/homework/project/case and death.csv')
monthly_case_2 = death_case['case']
monthly_death_2 = death_case['death']
# add 24 0 to the start of list representing all months 2018 to 2019
montly_case = [0]*24
montly_death = [0]*24
# add the following data to the monthly data
montly_case.extend(monthly_case_2)
montly_death.extend(monthly_death_2)
How to Extract YouTube Comments Using Youtube API – Python, https://www.geeksforgeeks.org/how-to-extract-youtube-comments-using-youtube-api-python/
Python Lambda, https://www.w3schools.com/python/python_lambda.asp
Youtube API, https://developers.google.com/youtube/v3/docs/commentThreads